Added Benchmark Evaluation Framework with CORE Benchmark Suite #403

Jasmine-Yuting-Zhang · 2025-11-11T04:39:58Z

This PR introduces comprehensive integration of the Nanochat language model stack into Plato, enabling federated learning experiments with GPT models (e.g, Nanochat). This integration adds Nanochat as a submodule, and adds the CORE benchmark for language model evaluations.

Description

Third-party submodule and model integration:

Nanochat submodule (external/nanochat): Git submodule integration of karpathy/nanochat.
Model factory (plato/models/nanochat.py): Nanochat model with configurable architecture parameters, checkpoint loading, and automatic tokenizer attachment.
Registry integration (plato/models/registry.py): Registered "Nanochat" model type for seamless configuration-based instantiation.

Tokenizer and data processing:

Rust tokenizer processor (plato/processors/nanochat_tokenizer.py): Wrapper for rustbpe + tiktoken stack with special token support and corpus training capabilities.
Streaming datasource (plato/datasources/nanochat.py): Configurable datasource supporting both real parquet data and synthetic token generation with automatic fallback.
Registry integration (plato/datasources/registry.py): "Nanochat" datasource registered for TOML configuration.

Training infrastructure:

Composable trainer (plato/trainers/nanochat.py): Specialized trainer with Nanochat-specific data loading, training steps, and optimizer strategies.
Multiple training strategies: Custom data loader, training step, optimizer, and testing strategies tailored for Nanochat models.
CORE evaluation integration: Built-in support for nanochat_core evaluation type with automatic benchmark execution.

Evaluation framework:

CORE benchmark adapter (plato/evaluators/nanochat_core.py): Complete port of nanochat/core_eval.py with automatic bundle download, task loading, and metric computation.
Comprehensive evaluation: Support for 22 language model evaluation tasks with centered accuracy metrics.

Configuration and examples:

Configuration templates (plato/configs/Nanochat/): Ready-to-use synthetic and extended evaluation configurations.
Example workspace (plato/examples/nanochat/): Documentation, setup instructions, and quickstart guides.

How has this been tested?

Tested CORE benchmark evaluation with configuration file synthetic_micro.toml.
Test execution and results:
Command:

uv run --extra nanochat python plato.py --config configs/Nanochat/synthetic_micro.toml

Output showing successful CORE benchmark evaluation on 22 tasks after 1 round of FL training session:

[INFO][23:22:05]: [Server #27388] Started model testing.
[INFO][23:23:10]: CORE task hellaswag_zeroshot | accuracy 0.2500 | centered 0.0000 | 64.92s
[INFO][23:23:38]: CORE task jeopardy | accuracy 0.0000 | centered 0.0000 | 27.83s
[INFO][23:24:03]: CORE task bigbench_qa_wikidata | accuracy 0.0000 | centered 0.0000 | 25.45s
[INFO][23:25:09]: CORE task arc_easy | accuracy 0.2500 | centered 0.0000 | 65.78s
[INFO][23:26:13]: CORE task arc_challenge | accuracy 0.1875 | centered -0.0833 | 63.81s
[INFO][23:26:32]: CORE task copa | accuracy 0.7500 | centered 0.5000 | 19.09s
[INFO][23:27:35]: CORE task commonsense_qa | accuracy 0.1250 | centered -0.0938 | 63.70s
[INFO][23:28:18]: CORE task piqa | accuracy 0.5000 | centered 0.0000 | 42.78s
[INFO][23:28:38]: CORE task openbook_qa | accuracy 0.5000 | centered 0.3333 | 20.04s
[INFO][23:28:59]: CORE task lambada_openai | accuracy 0.0000 | centered 0.0000 | 21.03s
[INFO][23:30:02]: CORE task hellaswag | accuracy 0.2500 | centered 0.0000 | 62.35s
[INFO][23:30:23]: CORE task winograd | accuracy 0.5000 | centered 0.0000 | 20.95s
[INFO][23:30:44]: CORE task winogrande | accuracy 0.6250 | centered 0.2500 | 21.36s
[INFO][23:31:12]: CORE task bigbench_dyck_languages | accuracy 0.0000 | centered 0.0000 | 27.69s
[INFO][23:32:16]: CORE task agi_eval_lsat_ar | accuracy 0.3750 | centered 0.2187 | 63.91s
[INFO][23:32:43]: CORE task bigbench_cs_algorithms | accuracy 0.0000 | centered 0.0000 | 27.70s
[INFO][23:33:11]: CORE task bigbench_operators | accuracy 0.0000 | centered 0.0000 | 28.04s
[INFO][23:33:39]: CORE task bigbench_repeat_copy_logic | accuracy 0.0000 | centered 0.0000 | 27.48s
[INFO][23:34:06]: CORE task squad | accuracy 0.0000 | centered 0.0000 | 27.23s
[INFO][23:34:34]: CORE task coqa | accuracy 0.0000 | centered 0.0000 | 27.86s
[INFO][23:35:18]: CORE task boolq | accuracy 0.4375 | centered -0.4803 | 43.64s
[INFO][23:36:25]: CORE task bigbench_language_identification | accuracy 0.0625 | centered -0.0314 | 67.52s
[INFO][23:36:25]: [Server #27388] Average Centered CORE benchmark metric: 2.79%

Types of changes

Bug fix (non-breaking change which fixes an issue) Fixes #
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My code has been formatted using the Ruff formatter (ruff format) and checked using the Ruff linter (ruff check --fix).
My change requires a change to the documentation.
I have updated the documentation accordingly.

…core_eval.py).

…dule under 'external/nanochat'.

- Resolved a RuntimeError caused by non-contiguous tensors during view operations (in nanochat - gpt.py): "view size is not compatible with input tensor's size and stride...". Replaced .view() with .reshape()

- Resolved an issue where the configuration requested 'train_loss' in the results, but the server's get_logged_items() did not include it.

- To avoid vocabulary size mismatch between model and tokenizer during CORE evaluation.

- Updated log message from "global accuracy" to "Average Centered CORE benchmark metric" - Used ruff to format code

…ORE metadata so ty check is clean again.

- Added instructions for initializing submodules and resolving maturin build failure.

netlify · 2025-11-11T04:40:03Z

✅ Deploy Preview for platodocs canceled.

Name	Link
🔨 Latest commit	`2b7cf3d`
🔍 Latest deploy log	https://app.netlify.com/projects/platodocs/deploys/69164195a373a30008aee243

baochunli and others added 16 commits October 28, 2025 08:51

Initial support for the Nanochat model and its evaluation benchmark (…

3355698

…core_eval.py).

Added support for vendoring the external Nanochat repo as a git submo…

822a1cb

…dule under 'external/nanochat'.

ruff check --fix & ruff format.

e039b2c

Added benchmark configuration ([evaluation]) support in config.py.

ac5beba

Added test to verify that [evaluation] configuration is properly loaded.

4501781

Fixed tensor contiguity issue in datasource.

a8efbea

- Resolved a RuntimeError caused by non-contiguous tensors during view operations (in nanochat - gpt.py): "view size is not compatible with input tensor's size and stride...". Replaced .view() with .reshape()

Fixed KeyError: 'train_loss'.

f0bc22d

- Resolved an issue where the configuration requested 'train_loss' in the results, but the server's get_logged_items() did not include it.

Fixed train_loss aggregation in FedAvg server to handle None values.

4810080

Added evaluation configs for nanochat CORE metric.

eb736eb

Added automatic download of nanochat CORE evaluation bundle.

d9fe94a

Using tokenizer's vocab_size to match between model and tokenizer.

6f34950

- To avoid vocabulary size mismatch between model and tokenizer during CORE evaluation.

Added outputs for Nanochat CORE evaluation in FedAvg server.

35f25eb

Added specific logging output for CORE benchmark metrics.

e4ae761

- Updated log message from "global accuracy" to "Average Centered CORE benchmark metric" - Used ruff to format code

Typed the Nanochat datasource/optimizer plumbing and enforced valid C…

432fe50

…ORE metadata so ty check is clean again.

All nanochat tests now pass.

da04815

Updated nanochat README with setup and troubleshooting notes.

279d05e

- Added instructions for initializing submodules and resolving maturin build failure.

Jasmine-Yuting-Zhang requested a review from baochunli November 11, 2025 04:39

Jasmine-Yuting-Zhang added 2 commits November 13, 2025 18:37

Added configuration file for NanoChat Parquet mode.

af0bafa

Formatted code with Ruff and applied autofixes.

2b7cf3d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added Benchmark Evaluation Framework with CORE Benchmark Suite #403

Added Benchmark Evaluation Framework with CORE Benchmark Suite #403

Uh oh!

Jasmine-Yuting-Zhang commented Nov 11, 2025

Uh oh!

netlify bot commented Nov 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Added Benchmark Evaluation Framework with CORE Benchmark Suite #403

Are you sure you want to change the base?

Added Benchmark Evaluation Framework with CORE Benchmark Suite #403

Uh oh!

Conversation

Jasmine-Yuting-Zhang commented Nov 11, 2025

Description

How has this been tested?

Types of changes

Checklist:

Uh oh!

netlify bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for platodocs canceled.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify bot commented Nov 11, 2025 •

edited

Loading